Exploratory Data Analysis:

2017 Assessed Value of Residential Property in Madison, WI

by Senay Goitom

==========================================================

Univariate Plots Section

##   [1] "ï..OBJECTID"            "Parcel"                
##   [3] "XRefParcel"             "Address"               
##   [5] "DateParcelChanged"      "PropertyClass"         
##   [7] "PropertyUse"            "AssessmentArea"        
##   [9] "AreaName"               "MoreThanOneBuild"      
##  [11] "HomeStyle"              "DwellingUnits"         
##  [13] "Stories"                "YearBuilt"             
##  [15] "Bedrooms"               "FullBaths"             
##  [17] "HalfBaths"              "TotalLivingArea"       
##  [19] "FirstFloor"             "SecondFloor"           
##  [21] "ThirdFloor"             "AboveThirdFloor"       
##  [23] "FinishedAttic"          "Basement"              
##  [25] "FinishedBasement"       "ExteriorWall1"         
##  [27] "ExteriorWall2"          "Fireplaces"            
##  [29] "CentralAir"             "PartialAssessed"       
##  [31] "AssessedByState"        "CurrentLand"           
##  [33] "CurrentImpr"            "CurrentTotal"          
##  [35] "PreviousLand"           "PreviousImpr"          
##  [37] "PreviousTotal"          "NetTaxes"              
##  [39] "SpecialAssmnt"          "OtherCharges"          
##  [41] "TotalTaxes"             "LotSize"               
##  [43] "Zoning1"                "Zoning2"               
##  [45] "Zoning3"                "Zoning4"               
##  [47] "FrontageFeet"           "FrontageStreet"        
##  [49] "WaterFrontage"          "TIFDistrict"           
##  [51] "TaxSchoolDist"          "AttendanceSchool"      
##  [53] "ElementarySchool"       "MiddleSchool"          
##  [55] "HighSchool"             "Ward"                  
##  [57] "StateAssemblyDistrict"  "RefuseDistrict"        
##  [59] "RefuseURL"              "PreviousLand2"         
##  [61] "PreviousImpr2"          "PreviousTotal2"        
##  [63] "AlderDistrict"          "AssessmentChangeDate"  
##  [65] "BlockNumber"            "BuildingDistrict"      
##  [67] "CapitolFireDistrict"    "CensusTract"           
##  [69] "ConditionalUse"         "CouncilHold"           
##  [71] "DateAdded"              "DeedPage"              
##  [73] "DeedRestriction"        "DeedVolume"            
##  [75] "ElectricalDistrict"     "EnvHealthDistrict"     
##  [77] "ExemptionType"          "FireDistrict"          
##  [79] "FloodPlain"             "FuelStorageProximity"  
##  [81] "HeatingDistrict"        "Holds"                 
##  [83] "IllegalLandDivision"    "LandfillProximity"     
##  [85] "LandfillRemediation"    "Landmark"              
##  [87] "LandscapeBuffer"        "LocalHistoricalDist"   
##  [89] "LotDepth"               "LotNumber"             
##  [91] "LotteryCredit"          "LotType1"              
##  [93] "LotType2"               "LotWidth"              
##  [95] "MCDCode"                "NationalHistoricalDist"
##  [97] "NeighborhoodDesc"       "NeighborhoodPrimary"   
##  [99] "NeighborhoodSub"        "NeighborhoodVuln"      
## [101] "NoiseAirport"           "NoiseRailroad"         
## [103] "NoiseStreet"            "ObsoleteDate"          
## [105] "OwnerChangeDate"        "OwnerOccupied"         
## [107] "ParcelChangeDate"       "ParcelCode"            
## [109] "ParkProximity"          "Pending"               
## [111] "PlanningDistrict"       "PlumbingDistrict"      
## [113] "PoliceDistrict"         "PoliceSector"          
## [115] "PreviousClass"          "PropertyUseCode"       
## [117] "RailroadFrontage"       "ReasonChangeImpr"      
## [119] "ReasonChangeLand"       "SenateDistrict"        
## [121] "SupervisorDistrict"     "TifImpr"               
## [123] "TifLand"                "TifYear"               
## [125] "TotalDwellingUnits"     "TrafficAnalysisZone"   
## [127] "TypeWaterFrontage"      "UWPolice"              
## [129] "WetlandInfo"            "ZoningAll"             
## [131] "ZoningBoardAppeal"      "UrbanDesignDistrict"   
## [133] "HouseNbr"               "StreetDir"             
## [135] "StreetName"             "StreetType"            
## [137] "Unit"                   "StreetID"              
## [139] "StormOutfall"           "FireDemandZone"        
## [141] "FireDemandSubZone"      "PropertyChangeDate"    
## [143] "MaxConstructionYear"    "XCoord"                
## [145] "YCoord"                 "SHAPESTArea"           
## [147] "SHAPESTLength"
##            ï..OBJECTID                 Parcel             XRefParcel 
##              "integer"              "numeric"              "numeric" 
##                Address      DateParcelChanged          PropertyClass 
##               "factor"               "factor"               "factor" 
##            PropertyUse         AssessmentArea               AreaName 
##               "factor"              "integer"               "factor" 
##       MoreThanOneBuild              HomeStyle          DwellingUnits 
##               "factor"              "logical"              "integer" 
##                Stories              YearBuilt               Bedrooms 
##              "numeric"              "integer"              "integer" 
##              FullBaths              HalfBaths        TotalLivingArea 
##              "integer"              "integer"              "integer" 
##             FirstFloor            SecondFloor             ThirdFloor 
##              "integer"              "integer"              "integer" 
##        AboveThirdFloor          FinishedAttic               Basement 
##              "integer"              "integer"              "integer" 
##       FinishedBasement          ExteriorWall1          ExteriorWall2 
##              "integer"               "factor"               "factor" 
##             Fireplaces             CentralAir        PartialAssessed 
##              "integer"               "factor"              "logical" 
##        AssessedByState            CurrentLand            CurrentImpr 
##               "factor"              "integer"              "integer" 
##           CurrentTotal           PreviousLand           PreviousImpr 
##              "integer"              "integer"              "integer" 
##          PreviousTotal               NetTaxes          SpecialAssmnt 
##              "integer"              "numeric"              "numeric" 
##           OtherCharges             TotalTaxes                LotSize 
##              "numeric"              "numeric"              "numeric" 
##                Zoning1                Zoning2                Zoning3 
##               "factor"               "factor"               "factor" 
##                Zoning4           FrontageFeet         FrontageStreet 
##               "factor"              "numeric"               "factor" 
##          WaterFrontage            TIFDistrict          TaxSchoolDist 
##               "factor"              "integer"              "logical" 
##       AttendanceSchool       ElementarySchool           MiddleSchool 
##               "factor"               "factor"               "factor" 
##             HighSchool                   Ward  StateAssemblyDistrict 
##               "factor"              "integer"              "integer" 
##         RefuseDistrict              RefuseURL          PreviousLand2 
##               "factor"               "factor"              "integer" 
##          PreviousImpr2         PreviousTotal2          AlderDistrict 
##              "integer"              "integer"              "integer" 
##   AssessmentChangeDate            BlockNumber       BuildingDistrict 
##               "factor"              "integer"              "integer" 
##    CapitolFireDistrict            CensusTract         ConditionalUse 
##               "factor"              "numeric"              "integer" 
##            CouncilHold              DateAdded               DeedPage 
##              "integer"               "factor"              "integer" 
##        DeedRestriction             DeedVolume     ElectricalDistrict 
##              "integer"              "integer"              "integer" 
##      EnvHealthDistrict          ExemptionType           FireDistrict 
##              "integer"               "factor"              "integer" 
##             FloodPlain   FuelStorageProximity        HeatingDistrict 
##              "integer"              "integer"              "integer" 
##                  Holds    IllegalLandDivision      LandfillProximity 
##               "factor"              "integer"              "integer" 
##    LandfillRemediation               Landmark        LandscapeBuffer 
##               "factor"               "factor"              "integer" 
##    LocalHistoricalDist               LotDepth              LotNumber 
##               "factor"              "numeric"              "integer" 
##          LotteryCredit               LotType1               LotType2 
##              "integer"               "factor"               "factor" 
##               LotWidth                MCDCode NationalHistoricalDist 
##              "numeric"               "factor"              "integer" 
##       NeighborhoodDesc    NeighborhoodPrimary        NeighborhoodSub 
##               "factor"               "factor"               "factor" 
##       NeighborhoodVuln           NoiseAirport          NoiseRailroad 
##               "factor"              "integer"              "integer" 
##            NoiseStreet           ObsoleteDate        OwnerChangeDate 
##              "integer"              "logical"               "factor" 
##          OwnerOccupied       ParcelChangeDate             ParcelCode 
##               "factor"               "factor"               "factor" 
##          ParkProximity                Pending       PlanningDistrict 
##              "integer"              "logical"               "factor" 
##       PlumbingDistrict         PoliceDistrict           PoliceSector 
##              "integer"               "factor"              "integer" 
##          PreviousClass        PropertyUseCode       RailroadFrontage 
##               "factor"              "integer"               "factor" 
##       ReasonChangeImpr       ReasonChangeLand         SenateDistrict 
##              "logical"               "factor"              "integer" 
##     SupervisorDistrict                TifImpr                TifLand 
##              "integer"              "integer"              "integer" 
##                TifYear     TotalDwellingUnits    TrafficAnalysisZone 
##              "integer"              "integer"              "integer" 
##      TypeWaterFrontage               UWPolice            WetlandInfo 
##               "factor"               "factor"               "factor" 
##              ZoningAll      ZoningBoardAppeal    UrbanDesignDistrict 
##               "factor"              "integer"               "factor" 
##               HouseNbr              StreetDir             StreetName 
##              "integer"               "factor"               "factor" 
##             StreetType                   Unit               StreetID 
##               "factor"               "factor"              "integer" 
##           StormOutfall         FireDemandZone      FireDemandSubZone 
##               "factor"              "integer"              "integer" 
##     PropertyChangeDate    MaxConstructionYear                 XCoord 
##               "factor"              "integer"              "numeric" 
##                 YCoord            SHAPESTArea          SHAPESTLength 
##              "numeric"              "numeric"              "numeric"
## 'data.frame':    79022 obs. of  147 variables:
##  $ ï..OBJECTID           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Parcel                : num  6.08e+10 6.08e+10 6.08e+10 6.08e+10 6.08e+10 ...
##  $ XRefParcel            : num  6.08e+10 6.08e+10 6.08e+10 6.08e+10 6.08e+10 ...
##  $ Address               : Factor w/ 79021 levels "1 Abilene Ct",..: 19267 19373 19465 19704 19796 20005 20119 20199 59274 59239 ...
##  $ DateParcelChanged     : Factor w/ 886 levels "1993-04-28T00:00:00.000Z",..: 690 690 690 690 690 690 849 690 713 690 ...
##  $ PropertyClass         : Factor w/ 4 levels "Agricultural",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ PropertyUse           : Factor w/ 305 levels "","0 unit Apartment",..: 273 273 273 273 273 273 273 273 273 273 ...
##  $ AssessmentArea        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ AreaName              : Factor w/ 434 levels "2 units in Area 115",..: 303 303 303 303 303 303 303 303 303 303 ...
##  $ MoreThanOneBuild      : Factor w/ 2 levels "","Has more than one building": 1 1 1 1 1 1 1 1 1 1 ...
##  $ HomeStyle             : logi  NA NA NA NA NA NA ...
##  $ DwellingUnits         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Stories               : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ YearBuilt             : int  1960 1959 1959 1962 1959 1962 1964 1965 1958 1959 ...
##  $ Bedrooms              : int  3 4 3 3 3 5 5 4 3 4 ...
##  $ FullBaths             : int  1 1 1 2 2 2 2 2 1 2 ...
##  $ HalfBaths             : int  2 1 1 1 0 0 0 0 1 0 ...
##  $ TotalLivingArea       : int  1371 1488 1290 1043 1386 1008 990 1076 1208 1603 ...
##  $ FirstFloor            : int  1371 1488 1290 1043 1386 1008 990 1076 1208 1603 ...
##  $ SecondFloor           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ThirdFloor            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AboveThirdFloor       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FinishedAttic         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Basement              : int  1254 1008 1066 1008 1386 1008 935 1040 1208 1120 ...
##  $ FinishedBasement      : int  686 350 564 607 550 750 637 860 380 560 ...
##  $ ExteriorWall1         : Factor w/ 9 levels "","Aluminum/Vinyl",..: 9 9 2 9 2 9 9 2 2 9 ...
##  $ ExteriorWall2         : Factor w/ 9 levels "","Aluminum/Vinyl",..: 1 1 1 1 1 1 3 1 1 1 ...
##  $ Fireplaces            : int  1 1 0 0 1 1 1 0 0 1 ...
##  $ CentralAir            : Factor w/ 3 levels "","NO","YES": 3 3 3 3 3 3 3 3 3 2 ...
##  $ PartialAssessed       : logi  NA NA NA NA NA NA ...
##  $ AssessedByState       : Factor w/ 2 levels "","ASSESSED BY STATE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ CurrentLand           : int  61700 66000 69700 67000 62700 57000 65100 58200 65200 62200 ...
##  $ CurrentImpr           : int  125000 114700 108700 126900 129500 112100 114000 137300 113900 125400 ...
##  $ CurrentTotal          : int  186700 180700 178400 193900 192200 169100 179100 195500 179100 187600 ...
##  $ PreviousLand          : int  61700 66000 69700 67000 62700 57000 65100 58200 65200 62200 ...
##  $ PreviousImpr          : int  116100 106100 97000 119400 120300 101000 105500 128000 102200 116500 ...
##  $ PreviousTotal         : int  177800 172100 166700 186400 183000 158000 170600 186200 167400 178700 ...
##  $ NetTaxes              : num  4138 3998 3945 4306 4267 ...
##  $ SpecialAssmnt         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherCharges          : num  0 0 568 0 0 ...
##  $ TotalTaxes            : num  4138 3998 4513 4306 4267 ...
##  $ LotSize               : num  14270 14718 18867 14984 13334 ...
##  $ Zoning1               : Factor w/ 44 levels "A","AP","CC",..: 26 26 26 26 26 26 26 26 26 26 ...
##  $ Zoning2               : Factor w/ 53 levels "","CN","HIS-FS",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Zoning3               : Factor w/ 24 levels "","HIS-L","HIS-MH",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Zoning4               : Factor w/ 4 levels "","W","WP-17",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ FrontageFeet          : num  85.1 65.8 66 64.5 81.7 ...
##  $ FrontageStreet        : Factor w/ 2808 levels "","Aaron Ct",..: 1926 1926 1926 1926 1926 1926 1926 1926 1991 1991 ...
##  $ WaterFrontage         : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
##  $ TIFDistrict           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TaxSchoolDist         : logi  NA NA NA NA NA NA ...
##  $ AttendanceSchool      : Factor w/ 9 levels "","De Forest",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ ElementarySchool      : Factor w/ 31 levels "","Allis","Assigned",..: 13 13 13 13 13 13 13 13 13 13 ...
##  $ MiddleSchool          : Factor w/ 14 levels "","Assigned",..: 13 13 13 13 13 13 13 13 13 13 ...
##  $ HighSchool            : Factor w/ 6 levels "","East","Lafollette",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Ward                  : int  95 95 95 95 95 95 95 95 95 95 ...
##  $ StateAssemblyDistrict : int  78 78 78 78 78 78 78 78 78 78 ...
##  $ RefuseDistrict        : Factor w/ 22 levels "00","01A","01B",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ RefuseURL             : Factor w/ 12 levels "","http://www.cityofmadison.com/streets/documents/friA.pdf",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ PreviousLand2         : int  61700 66000 69700 67000 62700 57000 65100 58200 65200 62200 ...
##  $ PreviousImpr2         : int  112600 102700 93700 115700 116700 77600 102200 124300 98900 113000 ...
##  $ PreviousTotal2        : int  174300 168700 163400 182700 179400 134600 167300 182500 164100 175200 ...
##  $ AlderDistrict         : int  20 20 20 20 20 20 20 20 20 20 ...
##  $ AssessmentChangeDate  : Factor w/ 714 levels "","1981-01-04T00:00:00.000Z",..: 632 632 632 632 632 632 632 632 632 632 ...
##  $ BlockNumber           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BuildingDistrict      : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ CapitolFireDistrict   : Factor w/ 2 levels " - ","1 - Downtown Fire Safety District": 1 1 1 1 1 1 1 1 1 1 ...
##  $ CensusTract           : num  5.02 5.02 5.02 5.02 5.02 5.02 5.02 5.02 5.02 5.02 ...
##  $ ConditionalUse        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CouncilHold           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DateAdded             : Factor w/ 1305 levels "","1989-02-14T00:00:00.000Z",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ DeedPage              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DeedRestriction       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DeedVolume            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ElectricalDistrict    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EnvHealthDistrict     : int  31 31 31 31 31 31 31 31 31 31 ...
##  $ ExemptionType         : Factor w/ 46 levels " - ","1 - State Property",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ FireDistrict          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FloodPlain            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FuelStorageProximity  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HeatingDistrict       : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ Holds                 : Factor w/ 4526 levels "","HOLD:  PLAJM @ 65 Buttonwood Ct",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ IllegalLandDivision   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LandfillProximity     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LandfillRemediation   : Factor w/ 4 levels "","IN PROXIMITY TO KNOWN LANDFILL",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Landmark              : Factor w/ 3 levels "","A","L": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LandscapeBuffer       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LocalHistoricalDist   : Factor w/ 6 levels " - ","1 - Mansion Hill Historic District",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LotDepth              : num  0 0 0 0 0 0 0 0 150 0 ...
##  $ LotNumber             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LotteryCredit         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LotType1              : Factor w/ 5 levels "1 - Regular",..: 1 1 2 2 1 2 2 1 1 2 ...
##  $ LotType2              : Factor w/ 7 levels "0 - No Exception",..: 2 1 1 1 1 1 1 2 1 1 ...
##  $ LotWidth              : num  0 0 0 0 0 0 0 0 93 0 ...
##  $ MCDCode               : Factor w/ 1 level "MADC": 1 1 1 1 1 1 1 1 1 1 ...
##  $ NationalHistoricalDist: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NeighborhoodDesc      : Factor w/ 11 levels "Allied Drive",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ NeighborhoodPrimary   : Factor w/ 17 levels "0 - No description entered",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ NeighborhoodSub       : Factor w/ 3 levels "0 - No description entered",..: 1 1 1 1 1 1 1 1 1 1 ...
##   [list output truncated]

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0   144600   211500   318629   298000 97320000

Univariate Plots 1 & 2

The plots and summary above show the distribution of the feature of interest, CurrentTotal. This variable represents the total assessed value of a property (Land+Improvements). As the first histogram shows, the distribution is highly skewed, with a cluster of values around zero, and a max value of $97,320,000. A log transformation reveals a distribution with a peak around $200,000.

##                         summary(assessor_data$PropertyUse)
## Single family                                        46827
## Condominium                                          17017
## 2 Unit                                                3280
## Vacant                                                2983
## 4 unit Apartment                                       938
## Commercial exempt                                      907
## Agricultural                                           748
## 3 unit Apartment                                       577
## Condominium -other                                     573
## Office 2 sty or lg.                                    301
## Warehouse & office                                     248
## 8 unit Apartment                                       239
## Store 1 sty sm                                         214
## Condominium-notation                                   196
## Apartment & store                                      192
## Manufacturing                                          142
## Office - 1 story                                       134
## 6 unit Apartment                                       133
## 5 unit Apartment                                       123
## Warehouse 1 story                                      120
## M-1 vacant                                             111
## Condominium -office                                     95
## Shop center neighbor                                    88
## Restaurant                                              73
## Condominium -apt                                        72
## Condo -store/retail                                     69
## Pud vacant                                              68
## Bank, s & l                                             60
## C-2 parking lot                                         60
## Shop, 1 story sm.                                       60
## Tavern                                                  50
## Gas & store                                             49
## Store 1 sty lg dept.                                    49
## C-2 vacant                                              48
## Condominium-Warehouse                                   45
## 7 unit Apartment                                        44
## Apartment & office                                      44
## C-1 vacant                                              44
## C-3l vacant                                             42
## Shop & office                                           41
## Other                                                   39
## Store-warehse 1 sty.                                    39
## Garage, repair                                          38
## Rest drive-in w/seat                                    38
## Hotel                                                   37
## Medical clinic                                          32
## Warehouse, mini type                                    32
## C-3 vacant                                              31
## Day care center                                         31
## Gar new car & repair                                    31
## Office converted sm.                                    31
## Rest. w/bar & liquor                                    31
## Frat & sorority lg.                                     29
## Commercial Exempt Condo                                 28
## Warehouse, small                                        28
## M-1 parking lot                                         27
## Restaurant & apts.                                      27
## Store & office small                                    26
## Store & shop                                            26
## Store 2 sty small                                       26
## 10 unit Apartment                                       24
## Rooming house                                           23
## Tavern & apartment                                      23
## 24 unit Apartment                                       22
## C-3 parking lot                                         21
## Motel                                                   21
## 0 unit Apartment                                        20
## Office & retail                                         20
## Gar used car & fix                                      19
## 12 unit Apartment                                       18
## 16 unit Apartment                                       18
## 30 unit Apartment                                       18
## Nursing home                                            18
## Rpsm vacant                                             18
## 9 unit Apartment                                        17
## 13 unit Apartment                                       16
## 18 unit Apartment                                       16
## 20 unit Apartment                                       16
## 40 unit Apartment                                       16
## 72 unit Apartment                                       16
## Restaurant & office                                     16
## 14 unit Apartment                                       15
## Apartments & rooms                                      15
## 36 unit Apartment                                       14
## Office insur type lg                                    14
## Restaurant & store                                      14
## 11 unit Apartment                                       13
## 48 unit Apartment                                       13
## C-1 parking lot                                         13
## Garage, steel sm.                                       13
## 60 unit Apartment                                       12
## 64 unit Apartment                                       12
## C-3l parking lot                                        12
## Golf course                                             12
## Grocer, large                                           12
## Office medical                                          12
## Shop & warehouse                                        12
## Store, Big Box                                          12
## Shop & house                                            11
## (Other)                                                664

Univariate Plots 3 & 4

These plots explore the first explanatory feature of interest, PropertyUse. As the summary of PropertyUse shows, there are one hundred categories. I am interested in single-family/condominiums, which are the top two categories. The first histogram, which includes all values, is difficult to read. The second which plots only those with frequencies of more than 1000, visually summarizes the frequencies of the top four categories.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     862    1260    1267    1709   18144
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1018    1316    1360    1696    9342
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     157    1100    1375    1493    1745    9342

Univariate Plots 5, 6, 7

The histograms above show the distribution of TotalLivingArea. In the full raw dataset there are a large number of zeros, these are largely eliminated after subsetting the data to single family/condominiums.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0     3267     7800    23274    11250 24549987

Univariate Plots 8, 9

The LotSize feature, which corresponds to the area of the property lot in square feet, is heavily skewed right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   3.436   3.000 720.000

Univariate Plots 10, 11

The plots for Bedrooms is also skewed right with the interquartile range falling between 2 and 3 bedrooms. Restricting the range of the histogram better shows this distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.912   2.000 569.000

Univariate Plots 12, 13

The distribution of bathrooms is also highly skewed right. The interquartile range is between 1 and 2 full bathrooms.

Univariate Plots 14, 15

The histograms for HalfBaths indicate that this feature lacks variability as well, except for some outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1916    1959    1576    1988    2017

Univariate Plots 16, 17

The first plot of the raw data shows that there are over 15000 observations with YearBuilt = 0. Setting the minimum value of 1837, and adding tick marks at every 10 years, shows more clearly the distribution of YearBuilt. Of note is the presence of housing booms and busts across time, most clearly seen in the run up to the housing crisis of 2008, followed by the great recession.

##                             Allis         Assigned           Chavez 
##             6115             2274              164             2781 
##        Crestwood         Elvehjem          Emerson             Falk 
##             3365             3193             3104             2686 
## Franklin-Randall         Glendale          Gompers        Hawthorne 
##             9109             2564             2522             1726 
##           Huegel          Kennedy        Lake View Lapham-Marquette 
##             2495             4089              607             5036 
##          Leopold        Lindbergh           Lowell          Mendota 
##              743              790             1944             1618 
##  Midvale-Lincoln             Muir            Olson    Orchard Ridge 
##             3636             2060             2556             1690 
##         Sandburg           Schenk        Shorewood         Stephens 
##             1796             2335               83             2940 
##          Thoreau To be determined         Van Hise 
##             2360              273             2368

Univariate Plots 18

This plot shows the number of observation by elementary school. Of note is the large number of observations that are not assigned to any school. Investigation showed that in many cases, these were for homes located in Madison, but falling in the school districts of adjacent communities.

Data Cleaning

Initial plots and tables show that in order to realistically analyze the effect of property features on assessed value, we need to subset the data, removing commercial and large multi-family properties. In addition, the large number of missing values for Elementary/Middle/High school need to be filled with the “AttendenceSchool” value where possible.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0  172600  221500  250375  294700 4500000

Univariate Plots 19, 20

Compared to the the raw data, the plots above, which are drawn from data that was subsetted to include only single family/condominium properties, shows a much tighter, right skewed, distribution

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1100    1375    1492    1745    9342

Univariate Plots 21, 22

In previous iterations of this plot, I subset to excluded TotalLivingArea=0. This plot therefore does not show much change from plots 6, 7 above.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    4318    7998    8001   10725  873073

Univariate Plots 23, 24

Restricting the dataset to Single Family Homes/Condominiums results in a less right skewed distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.00    3.00    2.98    3.00   12.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.737   2.000   8.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4452  1.0000  4.0000

Univariate Plots 25, 26, 27

Restricting the dataset to Single Family Homes/Condominiums results in the removal of outliers such as the observation with 720 bedrooms.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1837    1952    1971    1970    1996    2017

Univariate Plots 28

This plot confirms that using the limits set in plot 17 result in a similar distribution on the subsetted data.

##                  Allis               Assigned                 Chavez 
##                   1995                     81                   2530 
##              Crestwood              De Forest               Elvehjem 
##                   2894                    120                   2871 
##                Emerson                   Falk       Franklin-Randall 
##                   2414                   1963                   4665 
##               Glendale                Gompers              Hawthorne 
##                   1813                   1720                   1234 
##                 Huegel                Kennedy              Lake View 
##                   2163                   3358                    533 
##       Lapham-Marquette                Leopold              Lindbergh 
##                   2410                    525                    747 
##                 Lowell             Mc Farland                Mendota 
##                   1431                    341                   1413 
## Middleton/Cross Plains        Midvale-Lincoln                   Muir 
##                   1703                   3032                   1736 
##                  Olson          Orchard Ridge               Sandburg 
##                   1966                   1532                   1353 
##                 Schenk              Shorewood               Stephens 
##                   1883                     51                   2422 
##            Sun Prairie                Thoreau       To be determined 
##                    590                   1970                    181 
##               Van Hise                 Verona               Waunakee 
##                   1810                    409                    331

Univariate Plots 29

In addition to subsetting the data, I addressed the issue of missing schools, since location (as indicated by attendence school) is an explanatory feature of interest in this analysis. This plot reflects the updated value for “ElementarySchool”.

Univariate Analysis

What is the structure of your dataset?

The Dataset has 79,022 observations, and 147 variables. It contains detailed information about the assessed properties as well as the assessed values for the current and previous years. For this analysis, I restricted the ‘PropertyUse’ variable to ‘Single Family’, or ‘Condominium’.

What is/are the main feature(s) of interest in your dataset?

The main features of interest are the assessed values for land and improvements, for the current and previous years: “CurrentLand”, “CurrentImpr”, “CurrentTotal”, “PreviousLand”, “PreviousImpr”, “PreviousTotal”. For the purpose of this particular analysis, I focus on “CurrentTotal”.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think that “LotSize”, “TotalLivingArea”, “PropertyUse”, “Bedrooms”, “FullBaths”, “HalfBaths”, and “YearBuilt” will support the investigation into assessed property value. Location is also important, but I will have to investigate to see which of the following location variables best predicts assessed value: “ElementarySchool”, “MiddleSchool”, “HighSchool”, “Ward”, “StateAssemblyDistrict”, “AlderDistrict”, “CensusTract”.

Did you create any new variables from existing variables in the dataset?

Yes. To generate a smoother distribution of prices, I did a log transformation. I also create a new variable called “HomeAge” by subtracting the value for “YearBuilt” (and adding 1) from the current year (for the assessment, this would be 2017).

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

In the raw data, Total Living Area has a sizeable number of zeros. There appear to be parcels that correspond to parking/storage for condos. To simplify the analysis, these are dropped.

Bivariate Plots Section

## 'data.frame':    58190 obs. of  9 variables:
##  $ CurrentTotal    : int  186700 180700 178400 193900 192200 169100 179100 195500 179100 187600 ...
##  $ PropertyUse     : chr  "Single family" "Single family" "Single family" "Single family" ...
##  $ YearBuilt       : int  1960 1959 1959 1962 1959 1962 1964 1965 1958 1959 ...
##  $ TotalLivingArea : int  1371 1488 1290 1043 1386 1008 990 1076 1208 1603 ...
##  $ Bedrooms        : int  3 4 3 3 3 5 5 4 3 4 ...
##  $ FullBaths       : int  1 1 1 2 2 2 2 2 1 2 ...
##  $ ElementarySchool: Factor w/ 36 levels "Allis","Assigned",..: 13 13 13 13 13 13 13 13 13 13 ...
##  $ MiddleSchool    : Factor w/ 19 levels "Assigned","Black Hawk",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ HighSchool      : Factor w/ 11 levels "De Forest","East",..: 5 5 5 5 5 5 5 5 5 5 ...

Bivariate Plots 1

The ggpairs plot above summarizes the relationships between the features of interst in this dataset. In particular, we see a relatively strong positive correlation between TotalLivingArea and CurrentTotal, and to a lesser extent a correlation between the number of full baths and the CurrentTotal.

Feature of Interest by other features

Bivariate Plots 2

The plot above shows that the mean assessed value for single family homes is slightly greater than the mean assessed value for condominiums. Moreover, the upper tail of the distribution for single family homes appears to be longer than that of condominiums.

Bivariate Plots 3

The plot above shows that, as is to be expected, condominiums do not have values for LotSize. The bulk of the distribution of LotSize for Single Family Homes lies between 1,000 and 10,000 square feet.

Bivariate Plots 4

As is to be expected, the plot above shows that the distribution of TotalLivingArea is wider for Single Family Homes than for condominiums.

Bivariate Plots 5

The box plots above show that in addition to the wider distribution for Single Family Homes, condominiums have a lower median TotalLivingArea, which is intuitive.

Bivariate Plots 6

The histogram above shows that, as one might expect, the housing stock of Single family homes is older than condominiums. The few condominiums that have YearBuilt values in the late 19th century early 20th century, were likely historic buildings (e.g. hotels) that were converted to condominiums recently.

Bivariate Plots 7

These box plots show that the distribution of assessed value for single family homes contains both more outliers, and a tighter inter-quartile range. Moreover, as the histogram above indicated, the median and mean values for single family homes are greater than for condominiums.

Bivariate Plots 8

The above box plots show that for single family homes the median and mean land value is greater than for condominiums. Interestingly the interquartile range is greater for condominiums.

Bivariate Plots 9

The box plots above show that the mean and median values for the assessed values of single family and condominium structures are much closer. Again, we see that the interquartile range for condominiums is much greater.

Bivariate Plots 10

The histogram above shows how the range of total living area varies by elementary school. It also indicates which elementary schools have a greater number of single family homes and condominiums.

Bivariate Plots 11

The histogram above shows that, as one would expect, the central tendency of total living area increases with the number of bedrooms.

## [1] -0.04254596

Bivariate Plots 12

The above plot and correlation coefficient (-0.043) show that there appears to be no relationship between when a house was built and its total assessed value.

## [1] 0.7423409

Bivariate Plots 13

The plot and correlation coefficient above (0.742) indicate that there is a relatively strong positive relationship between the total living area of a home and its total assessed value.

## [1] 0.2503815
## Source: local data frame [2 x 2]
## 
##     PropertyUse         COR
## 1   Condominium -0.01483127
## 2 Single family  0.18849254

Bivariate Plots 14

There appears not to be a strong relationship between lot size and assessed value. However, this is likely due to the fact that many condominiums do not have lot sizes, or lot sizes of zero. Filtering out condos may reveal a stronger relationship for single family homes.

## Source: local data frame [2 x 2]
## 
##     PropertyUse         COR
## 1   Condominium -0.01483127
## 2 Single family  0.18849254

Bivariate Plots 15

Subsetting this to single family homes only appears to weaken the correlation between lot size and total assessed value.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The most important explanatory feature that I wanted to examine was PropertyUse. It is reasonable to believe that condominiums and single family detached homes (“SFDHs”) represent qualitatively different markets. As such, I investigated how the distribution of other features of interest varied across these groups. The most noteworthy finding was that for the vast majority of condominiums, LotSize=0. This has important implications for the inclusion of LotSize in any regression, since it strongly covaries with PropertyUse. The distribution of the main feature CurrentTotal was wider for condominiums than for SFDHs. When I looked at the distribution of the components of CurrentTotal (CurrentLand and CurrentImpr), both of these appeared to have larger spreads for condominiums than for SFDHs.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is a relationship between the number of bedrooms and the TotalLiving Area.

What was the strongest relationship you found?

Total Living Area and CurrentTotal.

Multivariate Plots Section

## Source: local data frame [2 x 2]
## 
##     PropertyUse       COR
## 1   Condominium 0.6849256
## 2 Single family 0.7513470

Multivariate Plots 1

The positive correlation betweeen total living area and total assessed value is stronger for Single family homes than for condominiums.

## Source: local data frame [36 x 2]
## 
##    ElementarySchool       COR
## 1             Allis 0.6226428
## 2          Assigned 0.5649764
## 3            Chavez 0.6978508
## 4         Crestwood 0.7568220
## 5         De Forest 0.2401669
## 6          Elvehjem 0.7539876
## 7           Emerson 0.6556453
## 8              Falk 0.8384986
## 9  Franklin-Randall 0.8346005
## 10         Glendale 0.7017174
## ..              ...       ...

Multivariate Plots 2

The correlations between living area and assessed value vary considerably by elementary school. The general trend appears to be that the more desireable the location (as determined by elementary school), the higher the correlation between living area and assessed value. The exclusive enclave of Shorewood has the highest correlation.

## Source: local data frame [19 x 2]
## 
##              MiddleSchool        COR
## 1                Assigned  0.5649764
## 2              Black Hawk  0.6918834
## 3                Cherokee  0.7674481
## 4               De Forest  0.2401669
## 5                Hamilton  0.7927054
## 6               Jefferson  0.7590353
## 7              Mc Farland  0.6117191
## 8  Middleton/Cross Plains  0.8547688
## 9                O'Keeffe  0.8107192
## 10   Opt Cherokee/Hamiltn  0.6970512
## 11     Opt Toki/Jefferson  0.8301759
## 12                Sennett  0.6598879
## 13                Sherman  0.6917482
## 14            Sun Prairie -0.1236515
## 15       To be determined  0.3007001
## 16                   Toki  0.7939027
## 17                 Verona  0.6482604
## 18               Waunakee  0.8270424
## 19             Whitehorse  0.6160075

Multivariate Plots 3

The correlations for middle school areas are necessarily less extreme in range, as we aggregate up from elementary schools.

## Source: local data frame [11 x 2]
## 
##                HighSchool        COR
## 1               De Forest  0.2401669
## 2                    East  0.7064427
## 3              Lafollette  0.6377896
## 4              Mc Farland  0.6117191
## 5                Memorial  0.7612938
## 6  Middleton/Cross Plains  0.8547688
## 7                Optional  0.6428767
## 8             Sun Prairie -0.1236515
## 9                  Verona  0.6482604
## 10               Waunakee  0.8270424
## 11                   West  0.7683711

Multivariate Plots 4

The highest level of aggregation shows that there are still distinct differences across broad parts of the city of Madison and surrounding towns, despite the fact that the overall distribution is compressed.

Multivariate Plots 5

The plots show the relationship between living area and assessed value with outliers (below 1st percentile and above the 99th percentile) removed.

Multivariate Plots 6

The plot above focuses on the relationship between total living area and assessed value for Madison East High School only.

Multivariate Plots 7

The plot above focuses on the relationship between total living area and assessed value for Madison West High School only.

Multivariate Plots 8

The plot above focuses on the relationship between total living area and assessed value for Madison Lafollete High School only.

Multivariate Plots 9

The plot above focuses on the relationship between total living area and assessed value for Madison Memorial High School only.

Multivariate Plots 10

The plot above focuses on the relationship between total living area and assessed value for Middleton/Cross Plains High School only.

Multivariate Plots 11

The plot above focuses on the relationship between total living area and assessed value for condominiums only.

Multivariate Plots 12

The plot above focuses on the relationship between total living area and assessed value for Single family homes only.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of

looking at your feature(s) of interest?

I was particularly interested in the relationship between house size, location, and assessed value. By adding a trend line to the scatter plots, I was able to find that different neighborhoods had different slopes, indicating that depending on where a house was located, the relationship between size and assessed value was stronger or weaker. ### Were there any interesting or surprising interactions between features? In addition to the fact that different neighborhoods have different relationships between house size and assessed value, different neighborhoods also showed varying degrees of spread in the data. In other words, the variation of assessed value conditional on house size was greater for some neighborhoods than for others. ### OPTIONAL: Did you create any models with your dataset? Discuss the strengths ### and limitations of your model.


Final Plots and Summary

Plot One

Description One

The plot above breaks out the relationship between total living area and assessed value by the number of bedrooms. We see the general increase in total living area corresponds to an increasing number of bedrooms, as well as a much steeper relationship between total living area and assessed value for one bedroom homes.

Plot Two

Description Two

This plot shows a clear difference by high school in the assessed value, conditional on total living area.

Plot Three

Description Three

The distribution for Condominiums is bifurcated in a way that you don’t see for single family detached homes. This reflects the changing market for condos, where we see new luxury units at the high end of the value distribution, lower value units, but not as much in the middle of the distribution. Overlaying the scatter plots shows this clearly.

Reflection

I found that cleaning and subsetting the data was the most challenging aspect of this project. After an initial exploration, it was clear that the dataset included property types that I was not interested in examining, as well as wrinkles, such as the the fact school values were missing for properties that were not in the Madison school district. Further work could focus on developing an explicit model for assessed value based on the variables above.